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1 Tablemaker output files 

Tablemaker outputs the following set of related tab-delimited text files. Tablemaker is de- 
signed to be run on the output of Cufflinks and Cuffmerge but Ballgown can be used with 
any assembly output that can be converted into the following sets of tab-delimited files. 

• e_data.ctab: exon-level expression measurements. One row per exon. Columns are 
eJd (numeric exon id), chr, strand, start, end (genomic location of the exon), and the 
following expression measurements for each sample: 

— rcount: reads overlapping the exon 

— ucount: uniquely mapped reads overlapping the exon 

— mrcount: multi-map-corrected number of reads overlapping the exon 

— cov: average per-base read coverage 

— cov_sd: standard deviation of per-base read coverage 

— mcov: multi-map-corrected average per-base read coverage 
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— mcovsd: standard deviation of multi-map-corrected per-base coverage 

• i_data.ctab: intron- (i.e., junction-) level expression measurements. One row per intron. 
Columns are iAd (numeric intron id), chr, strand, start, end (genomic location of the 
intron), and the following expression measurements for each sample: 

— rcount: number of reads supporting the intron 

— ucount: number of uniquely mapped reads supporting the intron 

— mrcount: multi-map-corrected number of reads supporting the intron 

• t_data.ctab: transcript-level expression measurements. One row per transcript. Columns 
are: 

— t-id: numeric transcript id 

— chr, strand, start, end: genomic location of the transcript 

— t_name: Cufflinks-generated transcript id 

— num_exons: number of exons comprising the transcript 

— length: transcript length, including both exons and introns 

— gene_id: gene the transcript belongs to 

— genejname: HUGO gene name for the transcript, if known 

— cov: per-base coverage for the transcript (available for each sample) 

— FPKM: Cufflinks-estimated FPKM for the transcript (available for each sample) 

• e2t. ctab: table with two columns, eJd and tJd, denoting which exons belong to which 
transcripts. These ids match the ids in the e.data and t_data tables. 

• i2t.ctab: table with two columns, Lid and tJd, denoting which introns belong to which 
transcripts. These ids match the ids in the Ldata and t_data tables. 

2 The Ballgown pipeline and S4 class 

The Ballgown infrastructure is designed to be a tool-agnostic bridge between transcriptome 
assemblers and abundance estimation tools, and differential expression analysis pipelines in 
R and Bioconductor (Supplementary Figure [I]). In its current form, the software can be 
used with any assembly whose structure is specified in GTF format and a set of spliced read 
alignments in BAM format: Tablemaker can be used to estimate transcript-level FPKMs with 
Cufflinks , or RSEM (rsem-calculate-expression) can be used to estimate transcript- and 
gene-level TPMs and FPKMs. The Ballgown R package can automatically read in either 
Tablemaker or RSEM output, and any assembler outputting files in the Tablemaker format 
(Supplementary Section 1) can easily be incorporated into the pipeline. 
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Figure 1: Ballgown as a bridge between transcriptome assembly and fast, flexible 
differential expression analysis. The Ballgown workflow connects transcript assembly 
tools like Tophat2 and Cufflinks to Bioconductor tools like EdgeR and DESeq for downstream 
analysis. 

The Ballgown R package provides a comprehensive data structure for transcriptome as- 
semblies (Supplementary Figure [2J. We define an S4 class ballgown with slots for (a) 
expression data, including several measurements for transcripts, exons, and introns, (b) as- 
sembly structure, which utilizes the efficient GenomicRanges [15] data structure to store 
information about exons, introns, and transcripts, and (c) other relevant assembly data, 
including phenotype data, relationships between exons, introns, and transcripts, and paths 
to alignment files on disk for easy connection with the assembly. 

3 Data, notation, and statistical models 

There are two distinct components to the data that Ballgown is equipped to analyze: the 
actual structure of the assembled transcriptome: (1) genomic locations of features and the 
relationships between exons, introns, transcripts and (2) genes and the expression measure- 
ments for the features in the transcriptome. Here we precisely define both the assembly 
structure and the associated data. 

Assembly structure 

The transcriptome is assembled based on a set R of aligned RNA-seq reads. We denote the 
yth read from the zth sample with r yz , where y = 1, N z and z = 1, ...,n, so there are n 
samples in the study, and sample z has N z aligned reads. 
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Figure 2: The ballgown data structure. The Ballgown R package loads assembly data 
into an object with linked data frames of expression measurements for exons, introns, and 
transcripts. The object also loads information about exon, intron, and transcript structures 
and corresponding indexes for matching structures, expression measurements, and phenotype 
data. 

The transcriptome assembled from the reads consists of four types of features: transcripts, 
genes, exons, and introns. These features all have start and finishing positions on the genome, 
which represent using the functions s() and /(), e.g., s(x) represents the start position of 
feature x. The K assembled transcripts are denoted by where k = 1,...,K. These 
transcripts can be organized into G genes, denoted by gi, I = 1,...,G. Each gene can be 
represented by a set of transcripts falling within its boundaries: 

9i = {t k ■ s(t k ) > s(gi) and f(t k ) < f(g l )} 

The assembly also contains M exons, each of which we represent as a closed interval of 
genomic locations: 

e m = [s{e m ), f(e m )],m = 1, M 

With this notation, we can then represent transcript fcasa subset of the M exons comprising 
the assembly: 

t k = {e m ■ m E 4}, I k C {1, M} 

Here, 1^ represents the indices of the exons that make up transcript k. Note that the exon 
e m can belong to several different transcripts. We can then easily define s(tk) and /(tfc) in 
terms of exon boundaries: 

s(t fe ) = min{s(e m ) : m G 4} 
f(t k ) = max{/(e m ) : m € 4} 
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Finally, let Wk represent the wth element of Ik- Then we can denote the wth intron in 
transcript k with an open interval: 

ikw = (f(e Wk ),s(e {w+1)k )) 

In other words, ik w is simply the genomic interval between the wth and w + 1th exons of 
transcript k. 

With these definitions in place, we can now precisely define the reads r yz . An RNA-seq 
read is simply a subsequence of an RNA transcript. Using set notation, we can define each 
read using the form: 

r yz = S x E [E, E'\ : E < E' and x, E, E' 6 [^J e m for some k 

An assembly algorithm applied to the set of reads r yz produces estimates of the exons: 
e m , m = 1, . . . , M, transcripts: t k ,k = 1, ... K of the transcripts and genes: gi, I = 1, . . . , G. 
Most current statistical models treat this assembly as fixed and correct when performing 
analyses. But as we will demonstrate in the methods section, assembled transcripts are 
subject to error and may be improved through statistical analysis [T9l 126] . 



Expression data 

Next we can define expression measurements for each type of feature given a particular 
assembled set of transcripts. Here we define sensible expression measurements that are 
currently implemented in the Ballgown package, but the statistical models are flexible enough 
to handle other types of measures as well. 

For each sample z, each transcript has two measurements that are calculated by our 
upstream Ballgown preprocessing software: average per-base read coverage: cov(tk, z) and 
FPKM (fragments per kilobase of transcript per million mapped reads): FPKM(tk, z). 
Currently, these transcript-level measurements are estimated in Cufflinks via maximum like- 
lihood; the procedure is described in detail by [27] . 

Each gene g\ has one expression measurement for each sample, FPKM(gi, z). This 
measurement is reconstructed from the transcripts in gi as follows: first, the number of 
fragments per million mapped reads for sample z for each % G gi is calculated by multiplying 
FPKM(tk, z) by the length of transcript t& in kilobases. The gene's total fragments per 
million mapped reads is the sum of the transcript-level fragments per million mapped reads 
for all the transcripts in the gene. Finally, the gene-level FPKM is calculated by dividing 
the gene's total fragments per million mapped reads by the gene's length. 

The Ballgown preprocessor also calculates average per-base read coverage for each exon 
in the assembly, given the assembly structure and the aligned reads R. For sample z, we 
have: 

/(em)] 1 { bp G ^ 



cov(e m ,z) 



f(e m ) - s(e m ) + 1 
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Each exon also has a raw read count, defined as the number of reads whose alignments 
overlap that exon: 

rcount(e m , z) = } j l{r yz D e m ^ 0} 

Tyz&R 

The main expression measurement for introns is also raw read count, defined as the number 
of reads whose alignments support the intron in the sense that their alignments are split 
across that intron's neighboring exons: 

rcount(i kw ,z) = ^ l{s( r yz) c Cm and f{r yz ) G e m >} 

r yz &R 

where m < w k and m' > (w + 1)^. 

Statistical methods for detecting differential expression 

After exploring the structure of the assembled transcriptome and performing any necessary 
transcript post processing, the next step is to identify transcripts or genes that are differ- 
entially expressed across groups. Here we outline a framework for statistical analysis of 
transcript and gene abundances. To make the ideas concrete we use FPKM as the expres- 
sion measurement and transcripts as the feature of interest, but these can be replaced in the 
following model definitions with any of the expression measurements and any of the available 
genomic features in the assembly (genes, transcripts, exons, or introns). 

Differential expression tests are implemented as follows: for each transcript t k , the fol- 
lowing model is fit: 

p 

h{FPKM{t k , z)) =a k + J2 PpkXzv + e*k (!) 

P =i 

where: 

• FPKM(i k , z) is the FPKM expression measurement for transcript k for sample z 

• h is a transformation [3J to reduce the impact of mean-variance relationships observed 
in the counts |2j. For example, the transformation h(-) = log 2 (- + 1) is commonly 
applied in the analysis of sequence-count data [13] . 

• a k represents the baseline expression for transcript k 

• X zp represents covariate p for sample z. These covariates differ by experiment type. X z i 
generally represents a library size adjustment for sample z. Assuming c k represents 
the 75th percentile of all log FPKM values for transcript k, ballgown's default the 
covariate X\ z is: 

J2FPKM{t k ,z)l[FPKM{t k ,z) <= c k ] 

k 

This normalization term is derived from the "cumulative sum scaling" (CSS) normal- 
ization approach [20] . 
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• (3 pk quantifies the association of covariate p on the expression of transcript k 

• e represents residual measurement error 

A flexible approach to differential expression is to compare nested sub models of model ([!]) 
using parametric F-tests [23]. The null hypothesis can be as flexible as any linear contrast 
of the coefficients (3 p k but for simplicity we focus on null hypotheses of the form: H 0 : 
fipk = 0,p £ S versus the alternative that all (3 p k are nonzero. The general principle is 
that a model including any potential confounders plus the covariate(s) of interest - a 0/1 
indicator for group in the two-group comparison, several indicator variables for the multi- 
group comparison, or a generalized additive model [TT] for a time variable for timecourse 
experiments - is compared with a model that includes only the potential confounders. For 
the two models fit for each transcript k, Ballgown calculates the statistic 

RSSq-RSS! 

RSJh 
n-Pi 

where RSSq represents the residual sum of squares from the model without group or time 
covariates, RSS\ represents the residual sum of squares from the model including the co- 
variates of interest, Pq is the number of covariates in the smaller model, Pi is the number 
of covariates in the larger model, and n is the total number of samples. Under the null 
hypothesis that the larger model does not fit the data significantly better than the smaller 
model, this statistic follows an F distribution with (Pi — P 0 , n — Pi) degrees of freedom, 
so p- values can be generated by comparing the two models for each transcript k [16]. We 
control for multiple testing using standard FDR controlling procedures [24J. 



4 Methods for data analyses 
Processing the GEUVADIS data 



We downloaded the FASTQ files from the GEUVADIS project from http://www.ebi.ac. 
|uk/ena/data/view/ERP 001942. With this data, we: 

• Aligned reads with TopHat 2.0.9, using the -G option to align reads to the transcrip- 
tome first. We used the hgl9 genome reference available from the Illumina iGenomes 
project. 

• Assembled sample-specific transcriptomes with Cufflinks 2.1.1, using default options 
and no annotation 

• Merged sample-specific assemblies into an experiment-wide assembly with Cuffmerge 
2.1.1 

• Estimated feature expression and organized the assembly with Tablemaker so that all 
files described in Supplmentary Section 1 were available. 
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• Created several Ballgown objects using the Ballgown R package 

The resulting Ballgown objects include phenotype data available from several sources, 



including |http : //www. ebi . ac .uk/ena/data/view/ERP001942, the 1000 Genomes Project 



[5], and additional quality control data from GEUVADIS researchers (available at|https:// 


github . com/alyssaf razee/ballgown_code/blob/master/GEUVADIS_preprocessing/GD6 


QCstats . masterf ile . txt 


The Ballgown R objects are available for download at http: 


//f igshare . com/articles/GEUVADIS_Processed_Data/1130849 


So the objects can be 



feasibly loaded into memory and stored on disk, a separate object is available for each ex- 
pression measurement. 



InSilico DB Analyses 

InSilico DB [4J includes processed data from public experiments on the Sequence Read 
Archive. We downloaded the Cuffdiff2 output from the cancer versus normal and devel- 
opmental data sets from InSilico DB on March 5th, 2014. We extracted the p-values for 
differential expression for the cancer versus normal comparison [12] and the embryonic stem 
cells versus preimplantation blastomeres data. We also reformatted the FPKM values from 
this analysis and applied the linear models included in the Ballgown package to perform 
the comparison. The versions and parameters for the software used by InSilico DB were 
cufflinks, cuffmerge, cuffdiff: v 2.0.2, cufflinks -p 6 -q, tophat: v 2.0.4 -niate.inner.dist 80 
-no-coverage-search (personal communication Alain Coletta from the InSilico DB). 

In order to run the latest versions of TopHat, Cufflinks, and Cuffdiff2, we downloaded 
the raw sequencing reads from both experiments from the NCBI Sequence Read Archive 
[17] . The analysis steps were the same as the steps outlined for processing the GEUVADIS 
dataset in the previous subsection, except Tophat2 version 2.0.11 and Cufflinks version 2.2.1 
was used. In addition, there was a small change at the Cufflinks step: because the data sets 
in InSilico DB were created by estimating transcript abundances for pre-annotated isoforms, 
we did the same when we processed the data ourselves. This means we ran Cufflinks with the 
-G option and estimated FPKM values for Illumina's iGenomes annotated genes for hgl9. 
These are the isoforms considered in the analysis results. 

We analyzed data from an experiment comparing lung adenocarcinoma (n = 12) and 
normal control samples (n = 12) in nonsmoking female patients [12] and from an experiment 
comparing cells at five developmental stages [29]. Since Cuffdiff 2 was designed for two-class 
comparisons, we only compared expression between two developmental stages: embryonic 
stem cells (n = 34) and pre- implantation blastomeres (n = 78). We compared results 
from Cuffdiff 2 (versions 2.0.2 and 2.2.1), the linear models from Ballgown, the empirical 
Bayes linear modeling framework implemented in limma [23], and EBSeq [18] . a Bayesian 
framework designed for isoform-level differential expression. 
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Simulation Studies 

To ensure that the linear models implemented in Ballgown perform accurately, we performed 
two separate simulation studies. For both studies, reads were generated from 2745 annotated 
transcripts on Chromosome 22 from Ensembl [8J, using genome build GRCh37 and Ensembl 
version 74. Data was generated for 20 biological replicates, divided into two groups of 10, 
where 274 transcripts were randomly chosen to be differentially expressed (at a 6x increase 
in expression level) in one of the two groups, randomly chosen. 
The first simulation study was set up as follows: 

• Expression was measured in FPKM. Each transcript's baseline mean FPKM value 
was determined based on the distribution of mean FPKM values for highly-expressed 
transcripts in the GEUVADIS dataset. Specifically, the mean of all nonzero FPKM 
values was calculated for each transcript in the GEUVADIS dataset with mean FPKM 
larger than 100, and each isoform in the simulated dataset was assigned a randomly 
selected baseline mean FPKM from this distribution. 

• We defined a log-log relationship between a transcript's mean expression level and the 
variance of its expression levels: 

log variance = 2.23 log mean — 3.08 

This relationship was estimated empirically from the assembled GEUVADIS transcrip- 
tome (transcripts with mean FPKM values greater than 10) using simple linear regres- 
sion. The GEUVADIS dataset includes both biological and technical replicates, so this 
model should encompass both biological and technical variability. 

• Then, for each transcript, we randomly drew FPKM expression values from a log- 
normal distribution with the pre-set mean and variance. For the differentially expressed 
transcripts, the pre-set mean FPKM was 6 times larger in one group than in the other. 

• For each transcript, we also set a sample's expression level to 0 with probability po, 
which was estimated from the GEUVADIS data: for each simulated transcript, po was 
randomly drawn from the empirical distribution of the proportion of samples with zero 
expression, over transcripts in the GEUVADIS dataset with mean FPKM larger than 
100. 

• To translate the pre-set FPKM value into a number of reads to be generated from a 
transcript for a given sample, we used the definition of FPKM and calculated the num- 
ber of "fragments" (reads) that should be generated from a transcript by multiplying 
the set FPKM value by the transcript's length over 1000, then multiplying by an ap- 
proximate library size of 150,000 reads, over 1 million. The decision to use a mixture of 
two distributions (log-normal and point mass at 0) was informed by exploratory anal- 
ysis of the FPKM distributions among several transcripts in the GEUVADIS dataset. 
The exploratory analysis is available at |http : / /ht mlprev iew . github . io/?https : / /| 
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github . com/ alyssaf razee/ballgown_code/blob/master/ simulations/mean_var_relati 



html] 

This simulation setup made it such that more reads were generated from longer tran- 
scripts, as is expected with RNA-seq protocols. 

A second simulation was also conducted with a slightly simpler setup: 

• Expression was defined directly by the number of reads being generated from each 
transcript (instead of using FPKM). 

• The mean number of reads generated from each transcript was set to be 300, unless 
the transcript was randomly selected to be overexpressed in one group, in which case, 
that group's mean read number for that transcript was 1800. 

• The actual number of reads to be simulated from a transcript was drawn from a negative 
binomial distribution with mean jj, = 300 or 1800, and size equal to 0.005/i (so, 1.5 
for fi = 300 and 9 for fi = 1800). Note that in the negative binomial distribution, the 
variance is equal to /i + /i 2 /size. 

• Each sample's read counts were scaled and rounded such that approximately 600,000 
reads were generated per sample. 

For both these scenarios, the specified number of reads was then generated from tran- 
scripts using the polyester package. These simulated reads were then aligned to the genome 
using TopHat 2.0.11 (aligning to the annotated transcriptome first with the -G option), 
and the resulting alignments were used to assemble transcripts with Cufflinks 2.2.1. Cuffd- 
iff2 (2.2.1) was then run on the simulated datasets. For the Ballgown results, we used 
Tablemaker to organize the output, but because Tablemaker calls Cufflinks version 2.1.1 to 
estimate per-transcript FPKMs, we updated the ballgown object to use the FPKMs written 
in the isof orms . read_group_tracking file by Cuffdiff'2. 

The following models were fit for each transcript in each simulation scenario: 



H A : \og 2 (FPKM i + l) = P* + P* 1 grp i + r ] *q75 i + e* 
H 0 : ]og 2 (FPKM i + l)=p 0 +T } tf5 i + e i 

where grpi is the value of the group indicator for sample i and q75 is a library-size 
normalizing constant equal to the sum of the log of the nonzero FPKM values to the 75th 
percentile (known as "cumulative sum scaling" normalization; [20J ) . We then tested the 
hypothesis Ho : /3* = 0 versus the alternative that the coefficient was non-zero. For the 
analysis with average coverage we replaced FPKMi with acovi in the above equations. 

We performed simulation studies to precisely assess the accuracy of the differential ex- 
pression methods. However, assessing the accuracy of transcript-level differential expression 
is complicated because the annotated transcripts from which reads were generated do not 
exactly match the assembled transcripts which were tested for differential. This means there 
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is no standard way to define which assembled transcripts should be called differentially ex- 
pressed. In our accuracy assessments (Supplementary Figure [8]), we chose to identify the 
three closest assembled "neighbors" for each of the 274 truly DE annotated transcripts. Dis- 
tance was measured by percent overlap, so each annotated transcript's 3 closest assembled 
neighbors were the 3 transcripts overlapping it the most. All of these selected "neighbors" 
were considered as part of the sensitivity and specificity calculations: sensitivity was defined 
as the ratio of the number of truly differentially expressed annotated transcripts with at 
least one of its three closest assembled neighbors called differentially expressed to the total 
number of truly differentially expressed annotated transcripts. Specificity was defined as 
percentage of "non-neighbor" assembled transcripts that were correctly called not differen- 
tially expressed, where "non-neighbor" means the assembled transcript was not one of the 
three closest to an annotated transcript set to be differentially expressed. 

RIN Analysis 

We filtered to the 464 unique replicates in the GEUVADIS study pQ as indicated in the 
quality control data, and we analyzed only transcripts with FPKM > 0.1. We first searched 
for differential expression with respect to RNA quality (RIN) using the following set of nested 
linear models to each transcript. 

4 5 

H A : Xog^FPKMi + l)=/? 0 * + £ 0* t splme t (RINi) + ^ l* p l{Pop % = p) + r/*g75 4 + e* 

t=i p=i 

5 

H 0 : log 2 {FPKM i + l)=(3 0 + Y,"fpHPop i =p) + r ] q75 i + e i 

P =i 

Here i indicates sample and the subscript for transcript has been suppressed for clarity. 
Ho denotes the null model and Ha denotes the alternative. The first set of terms encode a 
natural cubic spline fit with 4 degrees of freedom between the RIN values and the FPKM 
levels; the term spline^ .R/iVj) refers to the tth B-spline basis term for sample i. The second 
set of terms encode a factor model for the relationship between population and FPKM and 
the last term is a library size normalization term that consists of the sum of log of the 
the non-zero FPKMs up to the 75th percentile for that sample ("cumulative sum scaling" 
normalization; [2D])- We then tested the hypothesis that H 0 : (3i = (3 2 = (3 3 = 0 versus the 
alternative that at least one coefficient was non-zero. All transcripts with a q-value [21] less 
than 0.05 were called significant. 

Next we attempted to identify transcripts that were significantly better explained by 
a non-linear polynomial fit, rather than a linear trend. We fit the following nested set of 
models: 
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3 5 

H A : \og 2 (FPKM i + l) = /3* + J2/3* t RIN t l +J2% 1 ( P °P^=P)+V*q75 i + e* (2) 

t=i p=i 

5 

if 0 : log 2 (FPKM i + l) = /3 0 + /3i J R/iV i + ^7 P l(^=p)+w75 i + e i (3) 

p=i 

and tested the hypothesis that ifo : fii — 03 — 0 versus the alternative that at least one 
of the higher order polynomial coefficients was nonzero. Again, all transcripts with a q-value 
[21] less than 0.05 were called significant. 

The transcripts in the figure were statistically significant at the FDR 5% level for this 
second analysis. In the plots, the curves represent the fitted values for the average library size 
within each population. We show one example each of a positive and negative relationship 
between expression and RIN. While there were several examples of associations in both 
directions, there were more positive associations, as expected (Supplementary Figure [3]). 



t-statistics for linear RIN coefficients 




-20 -10 0 10 

t-statistics 



Figure 3: Distribution of t-statistics for the linear RIN term for GEUVADIS 
transcripts. These are moderated t-statistics calculated with limma for the /3i coefficient in 
model (3), indicating directionality of the RIN-FPKM relationship. We observe associations 
in both directions, but as expected, there are more positive associations. 

eQTL Analysis 

We downloaded genotype information for the GEUVADIS cohort from |f tp : //f tp . ebi . ac . 
uk/pub/databases/microarray/data/experiment/GEUV/E-GEUV-l/genotypes/. We fil- 
tered to only SNPs with a minor allele frequency greater than 5%. We used the processed 
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transcriptome data from Tablemaker as described above. We removed samples that were 
sequenced multiple times according to the protocol described by GEUVADIS [I]. We cal- 
culated the first three principal components of the genotype data using the Plink software 
[21]. We filtered to transcripts with an average FPKM > 0.1 and took the log2 transform 
of the FPKM values. We then used the MatrixEQTL package [22j to perform the eQTL 
analysis testing an additive linear regression model for the SNPs adjusting for three expres- 
sion principal components and three genotype principal components. We filtered to only 
transcript-SNP pairs that were no more than 1000Kb apart. 

We recorded the histogram of p-values from all transcript-SNP pairs. We calculated an 
estimate of the fraction of null hypotheses based on the distribution of observed p-values [21] 
and obtained an estimate of itq = 0.942. The p-value histogram (Figure^) and QQ-plot of 
-loglO (p-values) (Figure |2}d) versus their theoretical distribution under the null do not show 
any gross deviation suggesting unmodeled confounding [7]. 




Figure 4: Distribution of statistical significance scores for all cis-eQTL tests a. P- 

value histogram for all p-values from cis-eQTL tests, the estimated fraction of null hypotheses 
is 94.2%. b. QQ-plot of -loglO (p-values) versus theoretical quantiles shows no gross deviation 
from expected behavior. 

For the transcript overlap analysis, we downloaded the list of significant cis-eQTL from 
|f tp : //f tp . ebi . ac . uk/ pub/ databases/microarray/data/ experiment/GEUV/E-GEUV- 1/genotype 
for the EUR and YRI populations. We identified all Ensembl genes overlapped to any degree 
by each assembled transcript. We then calculated the number of gene-SNP pairs in common 
between the GEUVADIS EUR and YRI analyses and our eQTL analysis. 
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5 Additional results: statistical significance 



5.1 Negative and positive control experiments 

The main manuscript describes two experiments we designed. The first was a negative 
control experiment, performed to demonstrate that the default methods in Ballgown perform 
appropriately in a scenario where there is no differential expression signal. Subjects in the 
FIN population group in the processed GEUVADIS dataset were randomly assigned to one 
of two groups, and all assembled transcripts for differential expression between those two 
groups. Linear models as implemented in Ballgown gave uniformly distributed null p-values, 
as expected (Figure la, main manuscript). However, the statistical results from Cuffdiff2 
(version 2.2.1, the newest release available as of August 2014) on the same dataset, gave p- 
values that were not uniformly distributed but instead were biased toward 1 (Supplementary 
Figure |5^,). At the exon level, the p- value distribution from EdgeR was also not uniform, 
having a bit of extra mass around 0.1 (Supplementary Figure^). These results show that 
a well-established, count-based methods gives a slightly too-liberal result on this kind of 
experiment and illustrates a potential conservative bias still present in Cuffdiff2 version 
2.2.1. 



(a) 



Cuffdiff negative control p-values 



(b) 



EdgeR negative control p-values 
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Figure 5: P-value histograms of results of differential expression analyses between two ran- 
domly selected groups, a. Transcript-level results from Cuffdiff2 version 2.2.1, showing a 
conservative bias. b. Exon-level results from EdgeR, showing a liberal bias. 



The positive control experiment in the main manuscript tested transcripts on the Y 
chromosome for differential expression between males and females in the FIN population 
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in the processed GEUVADIS dataset. Previous research [9] has shown earlier versions of 
Cuffdiff2 performing poorly on this type of experiment, but version 2.2.1 discovers some 
statistically significant differences in expression on the Y chromosome between males and 
females. Most of the p- values from the 29 tested transcripts were low (Figure [6]). However, 
433 transcripts were assembled, so Cuffdiff2 2.2.1 seems to be a bit conservative in this 
regard about when it should perform a test. 



Cuffdiff2: Y chromosome 




I I I I I I 

0.00 0.05 0.10 0.15 0.20 0.25 

p-values 



Figure 6: P-value distribution from Cuffdiff2 2.2.1, comparing expression of transcripts on 
the Y chromosome between males and females. 29 of 433 assembled transcripts were tested 



5.2 Differential expression analyses for cancer and cell type datasets 

Transcript-level differential expression analysis comparing lung adenocarcinoma (n = 12) and 
normal cells (n = 12), and comparing embryonic stem cells (n = 34) to pre-implantation 
blastomeres (n = 78), should show a strong differential expression signal, especially consider- 
ing the sample sizes for these experiments. In the cancer vs. normal comparison, there were 
19,748 transcripts with average FPKM greater than 1. Cuffdiff2 (version 2.2.1) identified 
4608 of these transcripts as differentially expressed (q < 0.05). F-tests comparing nested lin- 
ear models, as implemented in Ballgown, flagged 8875 of these highly-expressed transcripts 
as differentially expressed. Of 27,058 transcripts tested, EBSeq called 8736 differentially 
expressed (posterior probability of differential expression of at least 0.95). Similarly, in the 
cell type dataset, there were 16,430 transcripts with mean FPKM greater than 1. Cuffdiff'2 
(2.2.1) calls 6816 of these differentially expressed (q < 0.05) while Ballgown calls 9701 of 
them differentially expressed. And of 15,462 transcripts tested, EBSeq identifies 10,307 with 
posterior probabilities of differential expression of at least 0.95. 



15 



While both linear modeling and Cuffdiff2 produced reasonable p-value distributions for 
these experiments (Supplementary Figure uk,c), the relative numbers of differentially ex- 
pressed transcripts discovered and the p-value distribution shapes show that CuffdiffB is 
more conservative than the linear models. On its own, this result does not necessarily 
mean that Cuffdiff'2 (2.2.1) is too conservative, but Cuffdiff2 also produced conservative 
p-value distributions in the negative and positive control experiments (main manuscript), 
we have prior knowledge that the differential expression signal should be quite strong in a 
tumor/normal or a cell type comparison, and the numbers of discoveries made by another 
published differential expression method (EBSeq) align more closely with the results from 
the linear model comparisons. Together these results indicate that conservative bias per- 
sists in Cuffdiff2 (2.2.1). Past versions of Cuffdiff2, particularly 2.0.2, produced extremely 
conservative results on these datasets (Supplementary Figure [7}3,d), calling 1 of 4454 highly- 
expressed transcripts in the tumor/normal dataset differentially expressed, while Ballgown's 
linear models identified 774. Similarly, in the cell type dataset, Cuffdiff'2 2.0.2 found 0 of 
12,469 highly-expressed transcripts to be differentially expressed (q < 0.05) between embry- 
onic stem cells and preimplantation blastomeres, while the linear model tests in Ballgown 
found 6964. 

These results in large-scale studies suggest that Cuffdiff2's statistical significance esti- 
mates were strongly conservatively biased in version 2.0.2. While version 2.2.1 is better, 
Cuffdiff2 is still not producing uniformly-distributed null p-values and is more conservative 
than other differential expression methods. 

5.3 Simulation studies 

The results for the first simulation, where the differential expression was set at the FPKM 
level, were that Cuffdiff (2.2.1) showed the same conservative bias we observed in the neg- 
ative control experiment and possibly in the InSilico DB experiments. Using the g-value as 
a significance cutoff, Cuffdiff'2 called 1 transcript differentially expressed (controlling FDR 
at the 5% level), compared to 56 using Ballgown 's F-test (Supplementary Section 3). Ac- 
cordingly, the p-value distributions showed similar patterns to those we observed in the 
adenocarcinoma and developmental cell datasets (Figure [8^). While the accuracy of the 
transcript rankings was comparable for both methods - for the linear models in Ballgown, 
88 of the top 100 transcripts called differentially expressed were truly differentially expressed 
for Ballgown versus 85 for Cuffdiff '2.2.1 - an ROC curve based on q-value cutoff shows Ball- 
gown outperforming Cuffdiff 2 in terms of sensitivity and specificity (Figure |8}d). 

We hypothesize that transcript length normalization may have something to do with the 
problems observed in Cuff'diff'2 , s statistical significance estimation, because in the second 
simulation scenario, where the number of reads sampled from each transcript was indepen- 
dent of its length, Cuffdiff2 performed comparably to the linear model framework included 
in Ballgown, and both seemed to be performing accurately. The p-value histograms for both 
methods showed uniformly-distributed p-values at the high end and some signal at the low 
end, as expected (Figure [8^), and the ROC curves are approximately equivalent; both dis- 
play high sensitivity and specificity (Figure |8tl) . In this scenario, of the top 100 transcripts 
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Figure 7: Comparison of statistical significance estimates between Cuffdiff2 and 
linear models in real datasets a. Histograms of p-values from a comparison of 12 lung 
adenocarcinomas and 12 normal controls from female patients who never smoked. Ballgown 
in blue, Cuffdiff2 (2.2.1) in orange, b. Same comparison as in panel (a), but using the 
Cuffdiff2 version 2.0.2 results available from InSilico DB. Cuffdiff2 version 2.0.2 had a strong 
conservative bias. Linear model results from Ballgown differ from panel (a) because the 
FPKM estimates used were from an older version of Cufflinks, though the linear model results 
do not demonstrate conservative bias. c. Histograms of p-values from the comparison of 78 
pre-implantation blastomere samples and 34 embryonic stem cell samples {Ballgown in blue, 
Cuffdiff2 (2.2.1) in orange), d. Same comparison as in panel (c), but using the Cuffdiff'2 
version 2.0.2 results available from InSilico DB. As in panel (b), Cuffdiff 2 2.0.2 showed a 
strong conservative bias. 
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ranked by each method, 96 are truly differentially expressed for Cuffdiff2 and 90 are for 
the linear models implemented in Ballgown. This shows that the models impelemented in 
Cuffdiff2 are accurate under some conditions - e.g., when the number of sequencing reads 
from a transcript is unrelated to its length - that may be somewhat unrealistic. 

6 Comparison of average coverage and FPKM as ex- 
pression measurements in differential expression stud- 
ies 

We compared the previous differential expression results, which were based on measuring 
transcript abundance using FPKM, with analyses using average per-base read coverage as 
the transcript expression measurement instead. Doing this comparison was straightforward, 
since Tablemaker outputs FPKM estimates from Cufflinks along with a variety of other 
expression measurements for the features of a transcriptome. We first did a comparsion be- 
tween FPKM and average coverage using First, we used our simulated dataset to investigate 
the impact of using average coverage as the transcript expression measurement, compared to 
using FPKM, as was done in our previous analyses. To do this comparison, we re-analyzed 
the data we simulated in the scenario where differential expression occurred at the FPKM 
level (Supplementary Section 4, "Simulation studies", first simulation; Supplementary Figure 
[8^,-b) but used average coverage as the transcript-level expression measurement. The dif- 
ferential expression rankings measuring transcript expression with FPKM and with average 
coverage were highly correlated (Figure^), with a correlation coefficient of 0.66. The p- 
value distribution using average coverage (Figure ^jjp) was similar to the p- value distribution 
using FPKM (Figure [8^,), though only 25 transcripts were found to be differentially expressed 
(q < 0.05), compared to 56 using FPKM. We also observed correlated ranks (r = 0.57) be- 
tween the differential expression results testing whether RIN value affected expression in the 
GEUVADIS dataset (Figure [9^). These results confirm that in differential expression anal- 
yses, count-based and FPKM-based (length-normalized) expression measurements perform 
similarly. Ballgown allows users to perform analyses with whatever expression measurements 
are available for their transcriptome, so for example, RSEM [TS] users can use such as tran- 
scripts per million (TPM) [2B] as an expression measurement. The framework also facilitates 
easy exploration of the different measurement options. 



7 Computational efficiency and timing results 

Next we investigated the computational efficiency of our approach compared to the standard 
Cufflinks pipeline. Tophat and Cufflinks can be run on each sample separately, but Cuffdiff2 
must be run on all samples simultaneously. While Cuffdiff2 can make use of many cores on 
a single computer, is not parallelizable across computers. It has been noted that Cuffdiff2 
can take weeks or longer to run on experiments with a few hundred samples. This issue 
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(c) 



Simulated dataset (Negative Binomial) 
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ROC curve: negative binomial simulation 
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Figure 8: Comparison of statistical significance between Cuffdiff2 and linear models 

in Ballgown in simulated datasets a. Histograms of p-values from a simulated data 
set of 2,745 transcripts where differential expression was induced between 10 cases and 10 
controls in 10% of transcripts at the FPKM level (Ballgown in blue, Cuffdiff2 in orange), b. 
ROC curve comparing the abilities of Cuffdiff2 and linear modeling to identify differentially 
expressed transcripts in the FPKM simulation based on q-value. c. Histograms of p-values 
from a simulated data set of 2,745 transcripts in 10 cases and 10 controls, where 10% of 
transcripts were simulated to be differentially expressed, but the number of reads generated 
from each transcript was independent of trakScript length, d. ROC curve comparing the 
abilities of Cuffdiff2 and linear modeling to identify differentially expressed transcripts in 
the transcript-length-independent simulation study. 




Figure 9: Using average per-base coverage as transcript expression measurement 
instead of FPKM. a. Differential expression ranks for transcripts in a case/control sim- 
ulation (n = 10 per group), using FPKM as the expression measurement (x-axis) vs. using 
average coverage (y-axis). b. Distribution of p- values from differential expression tests be- 
tween the 10 cases and 10 controls, using average coverage as the expression measurement. 
This distribution is very similar to the distribution observed when using FPKM as the ex- 
pression measurement (Figure[8^). c. Rankings of the effect of RIN on transcript expression 
in the GEUVADIS dataset, using FPKM as the transcript expression measurement (x-axis) 
vs. using average coverage (y-axis). For visibility, 2000 transcripts were randomly sampled 
from the dataset for the plot. on 
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Figure 10: Timing results for the 667 GEUVADIS samples at each stage of the 
pipeline, a. Timing (in hours) for each sample to run through TopHat2. b. Timing (in 
hours) for each sample to run through Cufflinks, c. Timing (in hours) for each sample to 
run through Tablemaker. 



has led consortia and other groups to rely on unpublished software for transcript abundance 
estimation [U [5]. 

We compared each component of the pipeline in terms of computational time on one of our 
simulated dataset (the second, simpler scenario) with 20 samples and 2,745 transcripts. The 
Tophat2 - Cufflinks - Tablemaker- Ballgown pipeline was fastest, taking about 5.4 minutes per 
sample for Tablemaker, 2.3 seconds to load transcript data into R and less than 0.1 seconds 
for differential expression analysis. This is faster than the recently published Tophat2 - 
Cufflinks - Cuff quant -Cuffdiff2 pipeline [25], which required about 3 minutes per sample for 
Cuffquant and 19 minutes for differential expression analysis with Cuffdiff2. The Ballgown 
-Tablemaker pipeline was also substantially faster than directly running Cufflinks -Cuffdiff2 
, where the Cuffdiff2 step took about 68 minutes. For all these pipelines, Tophat2 took 
about 1 hour per sample and Cufflinks about 2 minutes per sample. All possible multicore 
processes ( TopHat, Cufflinks, Cuffdiff2, Cuffquant, Tablemaker) were run on 4 cores. 

We also calculated the per-sample distribution of processing times for each step in the 
Tophat2 - Cufflinks - Tablemaker pipeline for all 667 samples in the GEUVADIS study 



[T4] (Figure 10). Tablemaker took a median of 0.97 hours per sample (IQR 0.24 hours) 
on a standard 4 core computer; this calculation can be parallelized across samples. By 
contrast, Cuffdiff2 would take months to perform this analysis on a standard 4 core computer. 
Ballgown multiclass differential expression analysis between the CEU (n = 162), YRI (n = 
163), FIN (n = 114), GBR (n = 115) and TSI (n = 93) samples for 334,206 transcripts took 
42 minutes on a single core Desktop computer. 
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8 Software 



1. Ballgown - Available from Bioconductor [10] |http : //www . bioconductor . org/packages/ 
devel/bioc/html/ballgown.html Installation instructions and tutorial for use are 



available in the package vignette, and at https : //github . com/alyssaf razee/ballgown 



2. Tablemaker - Linux binary: lhttpT77figshare.com/articles/Tablemaker_Linux_ 
Binary/1053137; Mac OS X binary: lhttpT77figshare.com/articles/Tablemaker_ 



0S_X_Binary/1053136; source code/installation instructions: https://github.com/ 



alyssaf razee/tablemaker 



3. polyester - https : //github . com/alyssaf razee/ballgown/tree/master/polyester 



9 Scripts and Data 



Scripts and links to all data are available at: https : //github . com/alyssaf razee/ballgown_ 

|code/| 
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